Philadelphia has experienced significant demographic and economic transformations over recent decades, leading to notable implications for its urban housing market. These shifts have resulted in variations in median house values, which serve not only as a reflection of the city’s economic health but also as a proxy for broader social and spatial dynamics. Increasing median house values may indicate an influx of higher-income residents or early stages of gentrification, whereas declining values can be symptomatic of disinvestment and economic decline. Given these dynamics, accurately forecasting median house values is vital for urban planners and policymakers who are tasked with promoting sustainable and equitable urban development.
In our previous study, Ordinary Least Squares (OLS) regression was used to explore the relationships between median house values, the dependent variable, and several key socio-economic predictors in Philadelphia. These predictors included educational attainment, vacancy rates, the proportion of detached single-family homes, and poverty rate. All these factors influenced the homes price in different ways.
Although OLS regression provides a foundational understanding of the relationships between the predictors and the dependent variables, it has limitations when applied to spatial data. One of the key assumptions of OLS regression is that observations are independent of each other and without spatial autocorrelation. However, in spatial data, observations that are geographically close often exhibit similarity, leading to spatial autocorrelation and violating the assumption of the OLS. Violate the assumptions of OLS may lead to biased and inefficient estimates of the regression coefficients, and incorrect inferences about the relationships between the predictors and the dependent variable.
To address these limitations, this report employs advanced spatial regression techniques to predict median house values in Philadelphia. We use Spatial Lag Regression, Spatial Error Regression, and Geographically Weighted Regression (GWR) to account for spatial autocorrelation and spatial heterogeneity in the data. We examine whether those spatial regression model could accurate predict the homes price than the Ordinary Least Squares (OLS) regression models. By utilizing these spatial techniques, this study aims to improve the accuracy of the initial OLS findings and provide a more comprehensive understanding of the socio-economic and spatial factors influencing housing values. These insights will support more effective policy interventions and urban development strategies aimed at achieving equitable and sustainable growth in Philadelphia.
Spatial autocorrelation describes the degree to which a variable is correlated with itself across space. It shows the relationship of values within a single variable at nearby locations, helping in understanding patterns of spatial distribution and identifying clusters or dispersions in spatial data. The concept of spatial autocorrelation is rooted in The First Law of Geography, which states:
“Everything is related to everything else, but near things are more related than distant things.”
This principle suggests that geographically proximate areas tend to exhibit similar characteristics due to shared environmental, economic, or social factors.
Spatial autocorrelation measures how much a variable in one location is influenced by values in nearby locations. If observations that are closer to each other in space have related values, spatial autocorrelation will be positive. While if observations that are closer to each other have markedly different values, spatial autocorrelation will be negative.
Moran’s I is a widely used method for measuring spatial autocorrelation. The formula for Moran’s I is:
\[ I = \frac{N}{\sum_{i} \sum_{j} w_{ij}} \times \frac{\sum_{i} \sum_{j} w_{ij} (X_i - \bar{X}) (X_j - \bar{X})}{\sum_{i} (X_i - \bar{X})^2} \]
where:
A Moran’s I value close to +1 indicates strong positive spatial autocorrelation (clusters of similar values). A value near -1 suggests strong negative spatial autocorrelation (dispersion). A value near 0 implies no spatial autocorrelation.
When dealing with spatial data, we use spatial weight matrices to define relationships between observations. Given \(n\) observations, we construct an \(n \times n\) matrix that summarizes all pairwise spatial relationships in the dataset. These matrices are essential for estimating spatial regression models and calculating spatial autocorrelation indices.
There are several ways to define spatial relationships within a weight matrix. Queen Contiguity Matrix assigns a weight of 1 if two regions share a border or a vertex, otherwise 0. Rook Contiguity Matrix assigns a weight of 1 if two regions share only a border, otherwise 0. Distance-based Matrix assigns weights based on the inverse distance between observations.
In this report, we use the Queen contiguity weight matrix, which considers all neighboring regions that share either a boundary or a vertex.
Although we only use the queen contiguity weight matrix in the report, statisticians always use multiple spatial weight matrices to check the robustness of the results. Since different spatial weights can capture spatial dependencies at various levels of granularity, it can make sure the results are not merely an artifact of the matrix you’re using.
To determine whether spatial autocorrelation is statistically significant, we conduct a hypothesis test:
Null Hypothesis (\(H_0\)): No spatial autocorrelation, meaning that the spatial distribution of values follows a random pattern with no systematic clustering or dispersion. Each location’s value is independent of the values at neighboring locations.
Alternative Hypothesis 1 (\(H_{a1}\)): Positive spatial autocorrelation, meaning that similar values tend to cluster together. High values are surrounded by other high values, and low values are surrounded by other low values, forming distinct spatial patterns.
Alternative Hypothesis 2 (\(H_{a2}\)): Negative spatial autocorrelation, meaning that similar values tend to disperse rather than clustered. High values are surrounded by low values and vice versa, leading to a checkerboard-like spatial distribution.
To test significance, we conduct random shuffling. Firstly, we randomly shuffle the variable values across spatial locations multiple times (999 permutations is used in the report). Then, we compute Moran’s I for each permuted dataset to generate a reference distribution. We compare the observed Moran’s I to this distribution to determine if it is extreme, concluding whether the observed clustering pattern is statistically meaningful rather than occurring by chance.
If the observed Moran’s I falls in the extreme tail of the simulated distribution, we reject the null hypothesis (H₀) in favor of the appropriate alternative hypothesis. A p-value less than 0.05 typically indicates significant spatial autocorrelation.
While global Moran’s I provides a single statistic for the entire study area, Local Indicators of Spatial Association (LISA) provides insights into the presence of spatial autocorrelation at individual locations.
To determine whether local spatial autocorrelation is statistically significant, we conduct a hypothesis test:
Significance tests for local Moran’s I are conducted using random shuffling to ensure that detected clusters are not merely due to random chance. This process follows the same approach as global Moran’s I but involves randomly reshuffling the values of the variable across the study area while keeping the value at location \(i\) constant. By comparing the observed local Moran’s I to the distribution of values from these random permutations, statistical significance can be assessed.
If the observed \(I_i\) is extremely high or low compared to the reshuffled values, it is considered significant. The pseudosignificance value is estimated by noting the rank of the actual \(I_i\) among the permutations. For instance, if the original \(I_i\) ranks as the 97th highest among 999 permutations, the estimated pseudosignificance is \(p \approx 0.097\).
To analyze the relationship between socioeconomic factors and median house values in Philadelphia, we often use OLS (Ordinary Least Squares) Regression. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives. The key assumptions of OLS regression include:
Linearity assumes that the relationship between the dependent variable and the predictors is linear.
Independence of Observations assumes that the observations are independent of each other. There should be no spatial or temporal or other forms of dependence in the data.
Homoscedasticity assumes that the variance of the residuals \(\epsilon\) is constant regardless of the values of each level of the predictors.
Normality of Residuals assumes that the residuals are normally distributed.
No Multicollinearity assumes that the predictors are not highly correlated with each other.
No Fewer than 10 Observations per Predictor assumes that there are at least 10 observations for each predictor in the model.
In our first assignment, we used OLS regression to access how vacancy rates, single-family housing percentage, educational attainment, and poverty rates influence median house values in Philadelphia. All predictors were statistically significant. The model’s R-squared was 0.66, which indicate the model explain 66% of the variance in house values.
However, some predictors exhibited non-linear patterns, and spatial autocorrelation suggested dependence among observations. For OLS regression, one of the vital assumptions of OLS regression is that observations are independent of each other. In spatial data, observations that are geographically close often exhibit similarity, leading to spatial autocorrelation and violating the independence assumption. When spatial autocorrelation is present, values of a variable in nearby areas are related rather than randomly distributed. We need further test the spatial autocorrelation and key assumptions of OLS regression in order to improve the model’s accuracy and reliability.
Furthermore, when data has a spatial component, the assumption of normality of residuals often fails to hold. In some cases, spatial autocorrelation does not significantly impact regression analysis. If the dependent variable exhibits strong spatial autocorrelation while the error term does not, the regression coefficients and significance levels remain valid. Additionally, if both the dependent and independent variables share an identical spatial pattern, and the spatial dependencies in the dependent variable are fully explained by those in the independent variable, the residuals may be spatially independent. However, this is not always the case, and it is essential to test for spatial autocorrelation in residuals to ensure the validity of the model.
To test this assumption, spatial autocorrelation of the residuals can be examined using Moran’s I, which measures whether residuals are clustered, dispersed, or randomly distributed in space. As mentioned before, it is first extract the residuals and define a spatial weights matrix (e.g., Queen or Rook contiguity). Then, Moran’s I is computed to measure the degree of clustering in residuals, with values close to +1 indicating positive spatial autocorrelation, -1 indicating negative autocorrelation, and 0 suggesting randomness.
Another method to test for spatial autocorrelation in OLS residuals
is to regress them on the residuals from nearby
observations. In this report, nearby residuals refer to
residuals from neighboring block groups, as defined by the Queen matrix.
The regression line between the residuals, OLS_RESIDU and
WT_RESIDU (weighted residuals from neighboring groups),
help identify any spatial autocorrelation. The slope
(b) of this regression represents the strength of spatial
dependence. It is calculated by estimating the relationship between the
residuals of one observation and those of its neighbors.
In R, there are methods to test other key assumption as well. We will continue using R for the analysis.
Another key assumption is Homoscedasticity, which aassume that the variance of the errors (residuals) remains constant across all levels of the independent variables. In R, we used Breusch-Pagan Test, Koenker-Bassett Test(also known as the Studentized Breusch-Pagan Test). and White Test to detect heteroscedasticity.
If the p-value is less than 0.05, then we can reject the null hypothesis for the alternate hypothesis of heteroscedasticity.
Another assumption is Normality of Errors, which assumes that residuals follow a normal distribution—a crucial requirement for valid hypothesis testing and confidence intervals. In R, we used Jarque-Bera Test .
The p-value determines whether the residuals follow a normal distribution. If the p-value is less than 0.05, then we can reject the Null Hypothesis of normality for the alternative hypothesis of non-normality.
In this report, we also use R to run spatial lag and spatial error regressions. Spatial lag regression assumes the value of the dependent variable at one location is associated with the values of that variable in nearby locations, defined by weights matrix \(W\), whether rook, queen neighbors, or within certain distance of one another. In our context, the spatial lag model is defined as follows:
\[ \text{LNMEDHVAL} = \rho W \times \text{LNMEDHVAL} + \beta_0 + \beta_1 \times \text{PCTVACANT} + \beta_2 \times \text{PCTSINGLES} + \beta_3 \times \text{PCTBACHMOR} + \beta_4 \times \text{LNNBELPOV100} + \epsilon_i \] where:
The other term are same as in the OLS regression model, where:
The spatial error model, on the other hand, assumes that the residuals of the model are spatially autocorrelated.It assumes that the residual in one location is associated with residuals at nearby locations defined by the spatial weights matrix \(W\), in this case the queen spatial matrix. The spatial error model is defined as follows:
\[ \text{LNMEDHVAL} = \beta_0 + \beta_1 \times \text{PCTVACANT} + \beta_2 \times \text{PCTSINGLES} + \beta_3 \times \text{PCTBACHMOR} + \beta_4 \times \text{LNNBELPOV100} + \lambda W \times \epsilon + u \]
where:
The other term is the same as in the OLS regression model, where:
Both spatial error regression and spatial lag regression require standard assumptions of OLS regression, including linerarity, homoscedasticity, and normality of residuals, excepty for the assumptions of spatial independence among observations. This adjustment allows the model to account for spatial autocorrelation and spatial heterogeneity in the data through either the dependent variable (spatial lag model) or the error term (spatial error model). These two models minimize spatial patterns in residuals that could lead to biased and inefficient estimates.
We compare the results of spatial lag and spatial error regression with the OLS regression to decide whether the two spatial models perform better than OLS regression based on several criteria: Akaike Information Criterion (AIC), Schwarz Criterion (SC, also known as Bayesian Information Criterion, BIC), Log likelihood, and likelihood ratio test.
The Akaike Information Criterion (AIC) and Schward Criterion (SC or BIC) are used to compared the model’s goodness of fit. They work by estimating how much information is lost when a model is used to represent reality. Essentially, they balance how accurate the model is against how complicated it is. A lower AIC or SC score means the model does a better job at this balance.
The Log likelihood is a measure used in the maximum likelihood for fitting a statistical model to the data and estimating model parameters. Maximum likelihood picks the values of the parameters that make the observed data as likely as possible. The higher the log likelihood, the better the model explains the data.
The Likelihood Ratio Test is used to test whether adding a spatial dependence to a model (spatial lag or spatial error model) significantly improves the model’s fit compared to the OLS model. For this test:
To reject the null hypothesis for the alternative hypothesis that the spatial model provides a significantly better fit than OLS, the Likelihood Ratio Test should have a p-value is less than significant level, typically 0.05. Then, we can draw the conclusion whether the spatial model is better than OLS model. If not, the OLS model is adequate.
Note: the likelihood ratio test is not used to compare the spatial lag and spatial error model, but to compare the spatial model with the OLS model. The Likelihood Ratio test only work if compared between nested models, meaning that one model is simplified version of other – complicated model contains all the same parts as the simpler model, plus extra pieces. The spatial lag model and spatial error model is not in that case.
Alternatively, we can also compare the spatial models to OLS using the Moran’s I statistic,which measures the spatial autocorrelation of the residuals. Moran’s I ranges from -1 to 1, where -1 indicates perfect dispersion, 0 indicates no spatial autocorrelation, and 1 indicates perfect correlation. Our goals of using spatial model is to minimize the spatial autocorrelation of the residuals. If the Moran’s I of the residuals of the spatial model is closer to 0 than the Moran’s I of the residuals of the OLS model, then the spatial model is better at minimizing spatial autocorrelation. We can conclude that the spatial model is better captures the spatial dependencies in the data than the OLS model.
We also conduct Geographically Weighted Regression (GWR) analysis in R. Geographically Weighted Regression is a form of local regression that helps address spatial heterogeneity in data, which is essential when analyzing spatial data prone to Simpson’s Paradox – —a phenomenon where trends identified in aggregated data may differ from those found within smaller subsets of the data. GWR allows us to examine the relationships at a local level rather than assuming they are uniform across the study area. The general GWR model is defined as follows:
\[ y_i = \beta_{i0} + \sum_{k=1}^{m} \beta_{ik}x_{ik} + \epsilon_i \] where:
In Geographicaly Weighted Regression (GWR), local regression is performed by fitting regression model at each observing point, using a subset of neighboring points. These neighbors are weighted according to their distance from the focal point. The bandwidth controls the number of neighbors used in the regression, which influence the degree of locality in the model. A smaller bandwidth results in a more localized model, while a larger bandwidth results in a more global model.
There are two types of bandwidths: adaptive and fixed. Fixed bandwidth use a constant distance for all points, while an adaptive bandwidth adjusts dynamically, ensuring a consistent number of neighbors for each regression point, regardless of variations in data density. In this case, we use adaptive bandwidth, which is more appropriate as it accounts for varying spatial densities in the data. This adaptive method offers greater flexibility, allowing the model to better capture local relationships in areas with differing population distributions.
Although the GWR model allows for spatial variation in relationships, the standard OLS assumptions including linerity, independence of observation, homoscedasticity, and normality of residuals still apply. Multicollinearity is accessed using the condition number. A high multicollinearity can lead to unstable estimates and clustering in parameter estimates. It is also important to not that GWR does not provide p-value for coefficient, as the model focuses on exploring spatial patterns rather than testing global hypothesis.
The Global Moran’s I analysis for the dependent variable,
LNMEDHVAL (the natural log of median house value) reveals a
significant level of spatial autocorrelation. With a Moran’s I value of
0.794, the data exhibits strong positive spatial
autocorrelation, indicating that areas with similar median house values
tend to cluster together. This suggests that high-value areas are
surrounded by other high-value areas, while low-value areas are
surrounded by other low-value areas.
queen<-poly2nb(data, row.names=data$POLY_ID)
queenlist<-nb2listw(queen, style = 'W')
moran(data$LNMEDHVAL, queenlist, n=length(queenlist$neighbours), S0=Szero(queenlist))$`I` ## [1] 0.794
To validate the significance of this observed spatial
autocorrelation, a Monte Carlo permutation test was condicted using 1000
simulations. This approach involved randomly permuting the values of
LNMEDHVAL across spatial locations to generate a
distribution of the Moran’s I values under the null hypothesis of no
spatial autocorrelation. The results, showing in the histogram below,
indicate the observed Moran’s I value of 0.794 is significantly higher
than the values generated from random permutations, emphasizes the
presence of spatial autocorrelation in the data.
The results of the test has a p-value less than
0.001. This exceptionally low p-value rejects the null
hypothesis of no spatial autocorrelation, confirming that the spatial
autocorrelation in LNMEDHVAL is statistically
significant.
##
## Monte-Carlo simulation of Moran I
##
## data: data$LNMEDHVAL
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.8, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided
ggplot(data.frame(res = globalmoranMC$res), aes(x = res)) +
geom_histogram(bins = 100, fill = "#283d3b") +
geom_vline(xintercept = globalmoranMC$statistic, color = "#c44536", linetype = 'dashed', size = 1) +
labs(title = "Observed and Permuted Global Moran's I",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) We also create a scatterplot to visualize the relationship between the dependent variable (log transform median house value) and its spatial lag. . The x-axis represents the logged median house values LNMEDHVAL , while the y-axis displays the spatially lagged values of LNMEDHVAL, calculated based on a queen contiguity spatial weights matrix that considers neighboring spatial units. If no spatial autocorrelation were present, the plot would display no clear pattern. However, the plot shows a positive linear relationship between the logged median house values and their spatial lags, indicating the presence of positive spatial autocorrelation, where areas with higher median house values are surrounded by other high housing values areas. These results also support the findings of the Global Moran’s I analysis, which identified strong positive spatial autocorrelation in the dependent variable.
ggplot(data = data.frame(
LNMEDHVAL = data$LNMEDHVAL,
spatial_lag = lag.listw(queenlist, data$LNMEDHVAL)
), aes(x = LNMEDHVAL, y = spatial_lag)) +
geom_point(color = "#283d3b", alpha = 0.7, size = 0.6) +
geom_smooth(method = "lm", color = "#c44536", se = FALSE) +
labs(title = "Global Moran's I Scatter Plot",
x = "Logged Median House Value",
y = "Spatial Lag of LNMEDHVAL") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))Then, we analyze the Local Moran’s I values for the
dependent variable LNMEDHVAL. Unlike Global Moran’s I,
which looks at the overall pattern, Local Moran’s I helps us find
specific areas where values are similar to or different from nearby
areas. This allows us to identify clusters of high or low values and
unusual areas that stand out. We calculate Local Moran’s I for each
census block group to see if there are significant patterns in
transformed median housing values.
# d. Local Moran's I (LISA analysis) for LNMEHVAL
lmoran<-localmoran(data$LNMEDHVAL, queenlist)
localmoran <-cbind(data, as.data.frame(lmoran))To visualize the results, we created a significance map to show the P-value of Local Moran’s I. The p-value tells us whether the spatial autocorrelation is statistically significant (real) or not. Lower p-values refers to areas where the spatial autocorrelation is statistically significant, either hotspots where high values concentrate together or coldspots where lower values concentrated together. In the map, dark red (p ≤ 0.001), red (p ≤ 0.01), and light pink (p ≤ 0.05) indicate areas where similar median housing values are grouped together, showing a significant spatial pattern. In contrast, grey (p > 0.05) represents areas without a strong pattern, where median housing values are more randomly distributed. In Philadelphia, the map reveals notable clustering in areas including Center City, City Line Avenue,Roxborough and Germantown, Far Northeast Philadelphia, indicating statistically significant spatial correlations in these regions.
moranSig.plot <- function(df, listw, title) {
local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
df$Pr.z <- local[, "Pr(z != E(Ii))"]
df$pval_category <- cut(df$Pr.z,
breaks = c(0, 0.001, 0.01, 0.05, 1),
labels = c("0.000 - 0.001", "0.001 - 0.010", "0.010 - 0.050", "0.050 - 1.000"),
include.lowest = TRUE)
if (!inherits(df, "sf")) {
df <- st_as_sf(df)
}
ggplot(data = df) +
geom_sf(aes(fill = pval_category), color = NA, alpha = 0.9) +
scale_fill_manual(values = c("0.000 - 0.001" = "#800f2f",
"0.001 - 0.010" = "#ff4d6d",
"0.010 - 0.050" = "#ffccd5",
"0.050 - 1.000" = "grey"),
name = "P-Value") +
labs(title = title) +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 20, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
}
moranSig.plot(localmoran, queenlist, 'Significance Map of Local Moran I')Philadelphia County Map
Source: United Inspection Agency
To visualize it more clearly, we generated a Cluster Map to identify different types of clustering patterns. The cluster Map derived from the Local Moran’s I analysis classified census tracts in Philadelphia into high-high, high-low, low-high, low-low, or not significant. This map helps us analyze how housing values are distributed and whether they form significant spatial clusters.
High-High (HH) Clusters: Center City, Roxborough, Germantown, and Far Northeast Philadelphia showed High-High (HH) clusters, meaning areas with high median housing values are surrounded by other high-value areas. These regions typically indicate concentrated economic activity and rich neighborhoods. These areas remain high house values due to high demand and possibly presence of amenities and other attractive urban features.
Low-Low (LL) Clusters: City Line Avenue, the northern part of North Philadelphia, the northeastern part of University City, and the northwestern part of South Philadelphia exhibited Low-Low (LL) clusters, where areas with low median housing values are surrounded by other low-value areas. These clusters may indicate economically underdeveloped or lower-income neighborhoods. These regions may require targeted policy interventions to address underlying issues that contribute to lower housing values.
High-Low (HL) Clusters: Three census blocks in South Philadelphia showed High-Low (HL) clusters, where high-value areas are surrounded by low-value areas. These locations represent spatial outliers, potentially indicating newly developed commercial or residential projects in historically low-income neighborhoods or gentrification neighborhood.
Low-High (LH) Clusters: Three scattered census blocks in Philadelphia exhibited Low-High (LH) clusters, meaning areas with low median housing values are surrounded by high-value areas. These clusters may correspond to historically preserved districts with strict development regulations or industrial or undeveloped land within high-value urban centers.
Non-Significant Clusters: The remaining areas in Philadelphia were classified as Non-Significant in white on the map, where local Moran’s O Statistic was not significant. These regions are scattered throughout the city where house values do not exhibit significant spatial clustering patterns. These areas may have more diverse housing markets or vibrant economic characteristics compared to the clustered regions.
hl.plot <- function(df, listw) {
local <- localmoran(x = df$LNMEDHVAL, listw = listw, zero.policy = FALSE)
quadrant <- vector(mode = 'numeric', length = nrow(df))
m.prop <- df$LNMEDHVAL - mean(df$LNMEDHVAL)
m.local <- local[, 1] - mean(local[, 1])
signif <- 0.05
quadrant[m.prop > 0 & m.local > 0] <- 1 # high-high
quadrant[m.prop < 0 & m.local < 0] <- 2 # low-low
quadrant[m.prop < 0 & m.local > 0] <- 4 # low-high
quadrant[m.prop > 0 & m.local < 0] <- 3 # high-low
quadrant[local[, 5] > signif] <- 5 # insignificant
df$quadrant <- factor(quadrant, levels = c(1, 3, 5, 2, 4),
labels = c("High-High", "High-Low", "Non-Significant", "Low-Low", "Low-High"))
if (!inherits(df, "sf")) {
df <- st_as_sf(df)
}
ggplot(data = df) +
geom_sf(aes(fill = quadrant), color = "#848884", lwd = 0.07) +
scale_fill_manual(values = c("High-High" = "#800f2f",
"High-Low" = "#ff4d6d",
"Non-Significant" = "white",
"Low-Low" = "#8d99ae",
"Low-High" = "#2b2d42"),
name = "Cluster Type") +
labs(title = "Local Moran's I Cluster Map") +
theme(legend.position="right",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 20, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
}
hl.plot(data, queenlist)# e. OLS Regression Analysis
reg<-lm(LNMEDHVAL ~ LNNBELPOV+PCTBACHMOR+PCTSINGLES+PCTVACANT, data=data)
summary(reg)##
## Call:
## lm(formula = LNMEDHVAL ~ LNNBELPOV + PCTBACHMOR + PCTSINGLES +
## PCTVACANT, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2582 -0.2039 0.0382 0.2174 2.2434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.113778 0.046532 238.84 < 0.0000000000000002 ***
## LNNBELPOV -0.078903 0.008457 -9.33 < 0.0000000000000002 ***
## PCTBACHMOR 0.020910 0.000543 38.49 < 0.0000000000000002 ***
## PCTSINGLES 0.002977 0.000703 4.23 0.000024 ***
## PCTVACANT -0.019156 0.000978 -19.59 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.366 on 1715 degrees of freedom
## Multiple R-squared: 0.662, Adjusted R-squared: 0.662
## F-statistic: 841 on 4 and 1715 DF, p-value: <0.0000000000000002
# g. OLS residuals plotted
data$OLS_RESIDU<-rstandard(reg)
data$WT_RESIDU<-sapply(queen, function(x) mean(data$OLS_RESIDU[x]))
OLS.Residuals.Map<-tm_shape(data)+
tm_fill(col='OLS_RESIDU', style='quantile', title='Standardized OLS Residuals',
palette ='Blues')+
tm_layout(frame=FALSE, title = 'Standardised OLS Residuals')##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Blues" is named
## "brewer.blues"
## Multiple palettes called "blues" found: "brewer.blues", "matplotlib.blues". The first one, "brewer.blues", is returned.
# scatterplot of OLS_RESIDU by WT_RESIDU
ggplot(data, aes(x = WT_RESIDU, y = OLS_RESIDU)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Scatter Plot of OLS Residuals vs. Weighted Residuals",
x = "Weighted Residuals (WT_RESIDU)",
y = "OLS Residuals (OLS_RESIDU)") +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
# Run simple regression of OLS_RESIDU on WT_RESIDU
lm_residuals <- lm(OLS_RESIDU ~ WT_RESIDU, data = data)
summary(lm_residuals)##
## Call:
## lm(formula = OLS_RESIDU ~ WT_RESIDU, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.368 -0.445 0.058 0.462 5.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0128 0.0212 -0.6 0.55
## WT_RESIDU 0.7323 0.0324 22.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.879 on 1718 degrees of freedom
## Multiple R-squared: 0.229, Adjusted R-squared: 0.228
## F-statistic: 510 on 1 and 1718 DF, p-value: <0.0000000000000002
# h. Moran’s I of the OLS regression residuals
#Regressing residuals on their nearest neighbors.
res.lm <- lm(formula=data$OLS_RESIDU ~ data$WT_RESIDU)
summary(res.lm)##
## Call:
## lm(formula = data$OLS_RESIDU ~ data$WT_RESIDU)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.368 -0.445 0.058 0.462 5.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0128 0.0212 -0.6 0.55
## data$WT_RESIDU 0.7323 0.0324 22.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.879 on 1718 degrees of freedom
## Multiple R-squared: 0.229, Adjusted R-squared: 0.228
## F-statistic: 510 on 1 and 1718 DF, p-value: <0.0000000000000002
##
## Monte-Carlo simulation of Moran I
##
## data: data$OLS_RESIDU
## weights: queenlist
## number of simulations + 1: 1000
##
## statistic = 0.3, observed rank = 1000, p-value <0.0000000000000002
## alternative hypothesis: two.sided